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FINAL  REPORT 

MULTILEVEL  DECOMPOSITION  METHODS 
FOR  FLUID  PLASMA  MODELS* 

S.  MCCORMICK,  K.  BECKWITH,  L.  OLSON 

1.  Multilevel  Decomposition.  Multigrid  methods  are  often  well-suited  for 
large-scale  scientific  computing  problems  because  they  offer  the  possibility  of  solv¬ 
ing  equations  at  a  cost  that  is  optimal  in  the  sense  that  is  in  small  proportion  to  the 
number  of  unknowns.  However,  the  challenge  for  multigrid  and  other  matrix  equa¬ 
tion  solvers  on  large  parallel  machines  is  that  performance  can  suffer  from  the  high 
cost  of  communication  relative  to  that  of  computation.  While  Algebraic  MultiGrid 
(AMG  [1,  2])  solvers  scale  nearly  optimally,  they  too  are  increasingly  affected  by  rel¬ 
ative  communication  costs  as  the  number  of  processors  increase.  Indeed,  all  known 
parallel  multigrid  algorithms  experience  degrading  communication  costs  for  the  main 
computation  in  each  V-Cycle.  Previous  geometric  multilevel  approaches  developed  by 
[3,  4,  5,  6,  7,  8,  9,  10,  11]  aimed  to  control  this  cost  by  trading  communication  for  com¬ 
putation  using  redundant  processing  on  overlapping  grids.  Inspired  by  these  efforts, 
the  Algebraic  MultiGrid  Domain  (AMG-DD)  and  Range  Decomposition  (AMG-RD) 
methods  developed  in  this  project  achieve  the  same  goal  by  tasking  possibly  other¬ 
wise  idle  processors  to  perform  redundant  computations  via  a  domain  decomposition 
approach.  The  basic  idea  is  to  use  subdomains  that  fully  overlap  at  coarse  scales. 
The  departure  from  the  previous  geometric  approaches  is  to  exploit  the  benefits  of 
domain  decomposition  in  a  purely  algebraic  AMG  setting. 

AMG-DD  and  AMG-RD  first  assume  that  the  setup  for  an  effective  AMG  im¬ 
plementation  has  been  formed  [1,  2].  The  two  methods  then  use  the  global  AMG 
hierarchical  constructs  (coarse  grids,  coarse-grid  operators,  and  intergrid  transfer  op¬ 
erators)  to  create  a  composite  grid  for  each  processor.  Each  composite  grid  consists  of 
the  original  grid  in  and  about  the  subdomain  owned  by  its  associated  processor  and  of 
grids  that  are  increasingly  coarse  as  they  extend  away  from  the  processor  subdomain 
to  the  boundary.  These  composite  grids  are  formed  algebraically  and  directly  from 
the  hierarchical  constructs  determined  in  the  traditional  AMG  setup  phase.  In  this 
way,  AMG-DD  and  AMG-RD  can  be  thought  of  as  globally  overlapping  domain  de¬ 
composition  methods  that  have  reduced  communication  per  cycle,  since  they  do  not 
require  communication  on  each  grid  level  as  standard  AMG  methods  do.  Moreover, 
the  new  process  provides  an  efficient  communication  phase  between  each  cycle,  in  con¬ 
trast  to  the  expensive  all-to-all  communication  processes  of  the  previous  geometric 
approaches.  The  composite  grids  thus  provide  a  means  for  maintaining  effective  com¬ 
munication  between  processors  that  controls  cost  and  maintains  optimal  convergence 
rates. 

The  main  focus  of  this  effort  was  the  development  and  study  of  AMG-DD  and 
AMG-RD  as  possible  alternatives  to  existing  AMG  approaches  for  solving  large-scale 
matrix  equations  on  advanced  parallel  machines,  with  fluid  plasma  models  as  an 
ultimate  target.  To  this  end,  some  theoretical  properties  of  these  methods  were  es¬ 
tablished  and  serial  numerical  tests  of  their  convergence  properties  were  analyzed  over 
a  spectrum  of  parameters  on  a  model  problem.  Also  studied  were  some  heuristics  and 
a  parameter  influences  based  on  a  performance  model,  both  of  which  were  designed 
to  anticipate  the  potential  of  AMG-DD  and  AMG-RD  for  use  on  emerging  parallel 
architectures. 
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Since  AMG-DD  and  AMG-RD  are  constructed  on  top  of  an  existing  AMG  hi¬ 
erarchy,  the  cost  of  the  setup,  as  in  all  AMG  algorithms,  needs  to  be  addressed.  In 
a  purely  serial  setting,  setup  costs  for  AMG-DD  and  AMG-RD  can  be  up  to  about 
twice  that  of  standard  AMG  due  to  the  redundant  calculation  that  we  use.  How¬ 
ever,  in  parallel,  owing  to  the  construction  of  the  algorithm  from  existing  operators, 
the  increased  setup  cost  can  be  reduced  essentially  to  the  cost  of  communicating  the 
components  of  the  operators  needed  to  each  processor.  This  communication  can  be 
done  using  the  same  pattern  as  the  residual  communication  and,  as  such,  is  bounded 
in  cost  by  that  of  performing  one  extra  V-Cycle. 

The  numerical  test  that  were  performed  confirmed  that,  for  current  parallel  ar¬ 
chitectures,  trading  of  communication  for  computation  that  AMG-DD  achieves  is 
advantageous  for  current  and  anticipated  architectures  in  a  very  large  computational 
environment.  The  results  showed  that  these  new  methods  can  be  especially  effective 
when  used  in  a  nested  iteration  process  where  the  models  are  first  resolved  on  crude 
levels  to  provide  good  initial  data  for  increasingly  finer  resolutions. 

2.  Adaptive  AMG.  A  coupled  system  of  elliptic  partial  differential  equations 
can  be  treated  by  iterating  in  a  sequential  or  simultaneous  way  on  the  individual 
scalar  equations,  provided  the  coupling  is  not  too  strong.  This  is  the  approach  Chacon 
[12,  13,  14]  takes  in  the  solver  for  his  MHD  simulations,  and  it  provided  the  setting 
for  the  project’s  initial  development  and  testing  of  the  performance  and  scalability  of 
multilevel  decomposition  solvers.  However,  even  with  fairly  weak  coupling  between 
the  individual  equations,  it  would  be  more  effective  to  apply  the  multigrid  solvers  to 
the  full  system.  The  difficulty  in  treating  a  coupled  system  in  this  way  is  the  need 
to  develop  coarsening  strategies  that  properly  address  the  coupling,  and  this  can  be 
especially  challenging  for  the  automatic  coarsening  strategies  that  form  the  basis  for 
algebraic  multigrid  and  smoothed  aggregation.  To  enable  a  full  and  potentially  more 
effective  multigrid  solver  for  the  two-fluid  system,  the  project  therefore  focused  on 
extending  the  adaptive  version  of  SA  (aSA).  The  central  aim  here  was  to  develop  an 
efficient  algebraic  solver  that  could  be  applied  to  plasma  fluid  systems  with  algorith¬ 
mically  optimal  performance  in  the  sense  that  its  cost  is  essentially  proportional  to 
the  size  of  the  discrete  system.  Such  an  optimal  adaptive  solver  could  then  be  used 
with  the  multilevel  decomposition  approaches  to  enable  simulations  at  ultra-scales 
that  would  otherwise  be  derailed  by  the  growing  solver  costs. 

The  basic  idea  behind  aSA  is  to  aim  for  a  coarsening  strategy  that  guarantees  the 
so-called  weak  approximation  property,  which  states  that  the  coarse  approximation  to 
a  given  fine-grid  error  in  the  Euclidean  norm  is  smaller  than  some  constant  times  the 
energy  norm  of  that  error  (with  the  constant  assumed  to  be  inversely  proportional  to 
the  norm  of  the  matrix).  This  property  guarantees  certain  convergence  properties  of 
unsmoothed  aggregation  and,  with  subsequent  smoothing  of  interpolation,  it  tends  to 
produce  an  optimally  convergent  SA.  In  any  case,  this  aim  is  achieved  by  starting  with 
an  error  that  is  certain  to  be  in  need  of  a  coarse-grid  correction,  which  is  accomplished 
by  simply  applying  the  initial  SA  scheme  to  the  homogeneous  equation  beginning 
with  a  random  initial  guess.  If  slow  convergence  is  observed,  then  the  interpolation 
operator  is  adjusted  to  improve  its  approximation  to  the  resulting  error.  This  process 
is  repeated  until  good  convergence  is  achieved. 

Initial  tests  with  this  new  version  of  aSA  have  been  very  encouraging.  With 
just  one  or,  in  some  cases,  two  adaptive  cycles,  convergence  factors  are  achieved  that 
compare  to  model  problem  results.  The  tests  suggest  that  aSA  should  provide  a  very 
effective  solver  for  systems,  especially  for  coupled  plasma  models. 
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3.  Jacobian- Free  Newton-Krylov  Methods  for  Fluid  Plasmas.  Jacobian- 
Free,  Newton-Krylov  (JFNK)  methods  are  based  on  the  Newton-Krylov  methods, 
which  were  first  considered  in  the  work  of  Brown  and  Saad  [15,  16].  The  early  advances 
involved  the  realization  that  inexact  Newton  methods  were  simpler  and  cheaper  to 
implement  and  exhibited  similar  convergence  to  exact  Newton  methods.  Researchers 
also  explored  preconditioning  strategies,  such  as  the  Schwarz  methods  [17],  within 
the  Newton-Krylov  systems.  A  detailed  history  and  thorough  study  of  the  Jacobian- 
Free,  Newton-Krylov  methods  is  presented  by  Knoll  and  Keyes  in  [18].  The  JFNK 
methods  have  since  been  applied  to  a  range  of  problems  from  power  systems  [19]  to 
MHD  systems  [12,  13,  14]. 

Here,  the  JFNK  method  is  used  to  solve  the  nonlinear  system  of  equations  gen¬ 
erated  by  a  fully  implicit  temporal  discretization  of  the  constitutive  equations,  which 
exhibit  either  a  hyperbolic,  parabolic  or  elliptic  structure,  depending  of  the  plasma 
regime.  A  nonlinear  system  F(x)  =  0  must  be  solved  for  the  new  solution,  x,  at 
each  new  time  step.  The  solution  of  the  nonlinear  system  follows  Newton’s  method 
whereby  the  Jacobian  system  of  the  form 


<9F 

dx 


5xk 

k 


-F(xfc), 


(3.1) 


where  x^+i  =  xk  +  6xk  is  the  kth  iteration. 

The  JFNK  method  employs  an  inner  Krylov  method  as  the  linear  solver  for  the 
Jacobian  system  and  an  outer  Newton’s  method  for  the  nonlinear  problem.  The  key 
to  the  approach  is  that  the  inner  Krylov  method  requires  only  matrix- vector  products 
for  the  Jacobian  (linear)  system.  This  can  be  approximated  with  the  approximate 
Gateaux  derivative  as 


<9F 

dx 


v 

k 


F(xfc  +  ev)  -  F(xfc) 
e 


(3.2) 


The  error  in  this  approximation  is  proportional  to  e.  The  result  of  this  approximation 
is  a  matrix-free  implementation  of  the  matrix-vector  multiply;  however,  the  entire 
method  will  not  be  matrix-free  because  a  preconditioner  is  needed  to  have  reasonable 
convergence  and,  for  wide  applicability,  the  preconditioner  will  be  built  based  in  an 
algebraic  way  on  a  matrix.  In  this  project,  we  have  utilized  JFNK  solvers  developed 
at  Sandia  National  Laboratory  as  part  of  the  Trilinos  library  [20],  specifically  the 
Nonlinear  Object-Orientated,  Solutions  (NOX)  framework  [21,  22],  which  combines 
the  AztecOO  iterative  solvers  package  [23]  with  algebraic  multigrid  preconditioners 
available  within  the  MultiLevel  package  (ML,  [24]).  These  solvers  have  been  integrated 
into  the  Tech-X  USlM  tool  for  fluid-plasma-electromagnetic  simulations,  developed 
with  prior  support  from  AFOSR  under  grants  FA9550-12-C-0039  and  FA9550-14-C- 
0004  and  the  Department  of  Energy  under  grants  DE-SC0000833,  DE-SC0009585  and 
DE-AR0000566.  This  approach  has  allowed  us  to  rapidly  prototype  different  physics- 
based  preconditioners  to  develop  fully-implicit  solvers  for  plasma  physics  problems. 

3.1.  Preconditioners  for  Moving  Least  Squares  Problems.  The  algorithms 
for  solving  fluid  plasma  equations  in  the  USlM  code  rely  on  moving  least  squares  opera¬ 
tors  for  interpolation  of  fluid  quantities  on  the  computational  mesh  [25] .  As  a  first  step 
in  developing  preconditioners  for  fluid-plasma  problems,  we  investigated  constructing 
preconditioners  for  a  prototype  Laplace  problem  based  on  this  class  of  operators  in 
order  to  both  verify  that  these  operators  are  amenable  to  multigrid  preconditioning 
and  to  demonstrate  the  overall  scalability  of  the  NOX  framework  integrated  into  the 
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Fig.  3.1:  Scalability  of  ML  preconditioned  JFNK  solver  for  a  moving  least  squares 
discretization  of  a  Laplace  operator.  The  left  panel  shows  the  number  of  inner  Krylov 
solves  for  the  first  and  second  outer  Newton  iteration;  the  right  panel  shows  the 
preconditioner  setup  time,  the  time  taken  for  the  first  Newton  iteration,  the  time 
taken  for  the  second  Newton  iteration  and  the  total  time  to  solution.  Overall,  the 
approach  shows  good  weak  scalability. 


USiM  tool.  To  introduce  the  preconditioned  moving  least  squares  algorithm,  consider 
two-dimensional  data  Qi  on  N  points  (xi,  yt)  with  weights  Wi  where  i  =  0, ...,  N.  The 
Vandermonde  matrix  for  this  problem  takes  the  form: 


Wo  1  W0x0 

W\  1  W\X  1 

w0x0y0  w0x  l  Woyl 
wxxiyi  w  ix\  wiy\ 

do 

ai 

1 

O  >-l 

WnI  WnXn 

wNxNyN  wnx2n  wNy% 

aN 

qN 

This  equation  is  of  the  form  PA  =  Q ,  with  solution  A  =  [PTP]  1  PTQ ■  Let  B  = 
[PTP]  1  PT,  then  at  a  point  (x,y)  we  can  write  (using  tensor  notation): 


q(x,y)  =  BappaQP 
d'q{x-v)  = 


dx1 


dxi 


P  =  (l,a :,y,xy,x2,y2)J 


(3.4) 

(3.5) 

(3.6) 


In  two-dimensions ,  we  can  then  construct  a  finite  volume  discretization  of  the  Laplace 
operator  through: 


V2Q  = 


dxdy 


dq  (x,y) 
dx 


nd£  = 


dxdy 


Bn 


d  lp° 
1  dx1 


QV  ■  ndl 


(3.7) 


where  h  is  the  outwards  facing  normal  at  each  cell  edge  and  dt  is  the  length  of  the 
edge.  In  three-dimensions,  this  operator  becomes: 


y2Q  =  dv. 


dq  (x,  y,  z) 
dx 


hdA=dV 


Bn 


d  ipa 
1  dxi 


Q13  ■  ndA 


(3.8) 


The  matrix  used  to  precondition  this  operator  is  then  constructed  from: 


Pk 


1 

dV 


Pa/3 


dlpa 
dx 1 


•  fidA 
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This  is  then  applied  within  the  Jacobian-Free  Newton  Krylov  solver  as  a  right  pre¬ 
conditioner: 


<9F 

<9x 


Pk  1Pk6xk 
k 


-F(xfc), 


(3.10) 


We  construct  P 'k1  using  Trilinos  ML  Uncoupled  Smoothed  Aggregation  with  5 
levels;  Jacobi  smoothing  is  utilized  on  the  coarsest  level  and  Gauss  Seidel  smoothing 
on  all  others.  The  scalability  of  this  approach  for  a  simple  three-dimensional  lin¬ 
ear  Poisson-type  problem  (e.g.  V20  =  p  with  p  held  constant)  is  shown  in  Figure 
3.1.  The  data  of  these  figures  demonstrate  that  the  implementation  of  the  Trilinos 
infrastructure  within  the  USim  tool  exhibits  good  scalability  for  three-dimensional 
problems. 


3.2.  Physics-Based  Preconditioners  for  Adiabatic  Magnetohydrodynam¬ 
ics.  For  density,  momentum,  and  temperature,  the  resistive  MHD  equations  are  given 

by 


dp 

dt 


+  V  •  (pv)  =  0 


d(pv) 

dt 


+  V  •  (pvv  +  Ip  +  n)  =  J  x  B 


dT 

—  +  v  •  VT  +  ( 7  -  1)TV  •  v  =  0. 

The  evolution  of  the  magnetic  field  is  described  by: 

<9B 


dt 


=  —V  xvxB;  VxB  =  p0J;  V  •  B  =  0 


These  equations  can  be  cast  in  conservative  form  as  dtU  +  V  •  F  ( U ) 


(3.11) 

(3.12) 

(3.13) 


(3.14) 

S(U),  with: 


p 

pv 

'  o  ' 

u  = 

pv 

E 

;  F(U)  = 

pvv  +  (p  +  -jB  •  B)  I  —  BB 
(E  +  p+  |B  -"b)  v  —  B  (B  •  v) 

;  s(u)  = 

0 

0 

B 

"vB  Bv 

0 

(3.15) 

where  E  =  is  the  total  energy  of  the  fluid.  The  constraint,  V  •  B  =  0 

requires  special  mention;  physically  accurate  solution  of  eqn.  3.11  through  3.14  re¬ 
quires  that  this  constraint  is  enforced  during  the  evolution.  Approaches  such  as  con¬ 
strained  transport  [26]  have  been  developed  to  enforce  this  constraint  to  machine  pre¬ 
cision;  however,  these  methods  typically  do  not  extend  to  implicit  time  discretization 
or  high  order  schemes.  Instead,  we  adopt  Generalized  Lagrange  Multiplier  schemes 
due  to  [27],  where  the  MHD  system  is  augmented  to  include  a  scalar  potential,  ip  that 
acts  to  advect  divergence  errors  out  of  the  grid  at  a  speed  Ch  (corresponding  to  the 
fastest  hyperbolic  wave  speed  in  the  simulation  domain)  while  simultaneously  damp¬ 
ing  these  errors  to  zero  at  a  rate  c2/c2  (where  c2  is  a  problem  dependent  constant) 
[28].  With  this  augmentation,  the  MHD  system  becomes: 


u  = 

1 

1 _ 

;  F(U)  = 

pv 

pvvT  +  (p  +  bB  •  B)  I  —  BBt 
(pE  +  p+  -jB  -"b)  v  —  B  (B  ■  v) 

S(U)  = 

0 

0 

0 

B 

vBt  -  Bvt  +  ip 
c2B 

-pt>p 

L  p  J 

(3.16) 
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+  Ulixes  JFNK 
□  Ulixes  Explicit 


Fig.  3.2:  Convergence  of  a  multidimensional  circularly  polarized  Alven  wave  test  due 
to  [29].  The  left  panel  shows  the  current  normal  to  the  plane  after  one  complete 
traversal  of  the  grid.  The  left  panel  shows  the  norm  of  the  L2-error  as  a  function 
of  grid  resolution,  including  divergence  errors  for  an  explicit  second  order  Runge- 
Kutta  discretization  compared  to  the  fully  implicit  scheme  described  here.  Both 
discretizations  demonstrate  second  order  convergence  with  comparable  errors. 


The  addition  of  these  terms  leads  to  an  augmented  system  that  contains  two  addi¬ 
tional  modes,  carrying  jumps  in  the  normal  component  of  B  and  if).  These  waves  are 
decoupled  into  a  linear  hyperbolic  system  that  can  be  solved  analytically  [28] : 

dtBx  =  -dxip;  dtip  =  -c2hdxBx  (3.17) 

To  solve  dtU  +  V  •  F  ( U )  =  S(U)  using  JFNK  methods,  we  cast  the  system 
equations  as  a  non-linear  functional,  F(f7)  =  dtU  +  V  •  F  (U)  —  S(U).  Adopting  a  9 
temporal  discretization  [12,  13,  14],  we  have: 

F([/n+1)  =  Un+1  -Un  +  9At  [V  •  F  ( Un+1 )  -  S  (t/n+1)] 

+  (1  —  9)  At  [V  •  F  (Un)  -  S  (Un)}  ^3'18 


where  9  =  0, 1  corresponds  to  forward/backward  Euler  respectively,  whereas  9  =  0.5 
corresponds  to  Crank-Nicholson.  To  precondition  this  non-linear  system,  we  linearize 
the  apprxomate  Gateaux  derivative  (Eqn.  3.2)  acting  on  the  above  equation,  yielding: 


OF 

<9x 


SUk 

k 


jl  -  9 At 


dF  (17) 
dU 


dS(U)' 

dU 


SUk 


(3.19) 


We  again  apply  this  approximate  Jacobian  through  right  preconditioning  (Eqn.  3.10), 
requiring  the  computation  of  the  inverse  of  Eqn.  3.19.  In  order  to  compute  the  inverse, 
we  note  that  has  a  trivial  answer  for  the  system  of  equations  (3.16).  Determing 

the  form  of  js  more  complex.  USlM  makes  use  of  finite  volume  discretizations 

and  discontinuous  Galerkin  discretizations  as  described  in  [25].  As  implemented  in 
USlM,  both  of  these  discretization  methods  make  use  of  Riemann  solvers  to  compute 
self-similar  solutions  to  the  breakdown  of  a  discontinuity.  The  finite  volume  scheme 
(for  example)  then  integrates  this  solution  over  a  control  volume  to  construct  a  second 
(or  higher)  order  accurate  approximation  to  V  •  F  ( U )  though  (e.g.): 


F(Ul,Ur ) 


v ' F  (t/)  =  W  /  f('Uli  Ur)  ' MA 


2  L 


F{Ul)  +  F{Ur) 


A 


(UL  -  UR) 


(3.20) 

(3.21) 
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Fig.  3.3:  Evolution  of  a  weakly  magnetized  field  loop  due  to  [32,  33].  This  test  verifies 
that  the  divergence  of  the  magnetic  field  remains  zero  on  the  computational  stencil; 
non-zero  magnetic  field  divergence  causes  rapid  growth  of  magnetic  energy  and  the 
destruction  of  the  loop.  The  left  panel  shows  the  structure  of  the  magnetic  energy  on 
the  computational  domain  after  one  crossing;  the  right  panel  shows  the  evolution  of 
the  magnetic  energy  integrated  over  the  simulation  domain  over  one  crossing  period. 
The  data  of  these  two  panels  demonstrate  that  the  magnetic  field  loop  remains  well 
defined  and  that  the  magnetic  energy  decays  due  to  numerical  diffusion,  i.e.  the 
divergence  of  the  magnetic  field  is  zero  on  the  computational  stencil. 


where  we  have  adopted  a  Roe-type  solver  [30]  to  compute  a  self-similar  upwind  flux 
from  initial  states  at  the  left-  ( Ul )  and  right-  {Ur)  hand  sides  of  the  discontinuity. 
The  reconstruction  of  the  initial  data,  U,  located  at  control  volume  centers  to  form 
Ul  and  Ur  is  performed  by  the  moving  least  squares  algorithm  described  in  §3.1  and 
coefficients  for  this  data  are  included  in  the  matrix  for  forming  Ul  and  Ur.  Finally, 


the  dissipation  matrix, 


A 


A\  =  L|/(A)|I?,  where  L,R  are  the 


is  formed  through 

left  and  right  eigenvectors  of  the  system  and  we  have  applied  an  entropy  fix  to  the 
eigenvalues,  A  to  render  the  flux  differentiable.  To  construct  9fiAAjUr'1  ;  we  follow  the 
approach  described  by  [31]: 


dF  {UL,  Ur)  =  1 
dUL  2 


dF{UL) 

dUL 


A 


dF  {UL,  Ur)  =  1 
dUR  2 


\0F{Ur) 

A 

dUR 

(3.22) 


The  flux  jacobians,  dF(FL,un.)  ancj  c)F(Ul,Ur )  are  constructed  from  the  same  eigensys- 


duL  duR 

tern  (with  the  entropy  fix  applied)  as  the  dissipation  matrix, 


A 


The  eigensystem 


for  the  augmented  MHD  system  (3.16)  is  well  known  (e.g.  [29,  28])  and  can  be  used 
directly  to  construct  the  matrix  associated  with  Eqn.  3.22. 

To  precondition  the  MHD  system,  we  construct  P [71  using  Trilinos  ML  Uncoupled 
Smoothed  Aggregation  with  5  levels;  on  each  level,  we  use  block  ILU  smoothing  with 
zero  overlap  and  symmetric  Gauss-Seidel  relaxation  on  each  level  and  choose  the  block 
size  to  be  equal  to  the  number  of  partial  differential  equations  in  the  system  (here,  this 
is  9).  In  Fig.  3.2  we  demonstrate  that  this  our  approach  exhibits  second  order  accuracy 
for  propagation  of  a  non-linear  circular  polarized  Alfven  wave  in  multidimensions  due 
to  [29].  To  our  knowledge,  this  is  the  first  demonstration  of  a  fully  implicit  algorithm 
for  magnetohydrodynamics  with  second-order  accuracy.  In  Fig.  3.3  we  utilize  the  field 
loop  advection  test  due  to  [32,  33]  to  demonstrate  that  that  our  algorithm  preserves 
the  divergence  of  the  magnetic  field  in  multi-dimensions.  In  Figure  3.4  we  demonstrate 
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MHD  KHI:  Second  Newton  Step 


MHD  KHI:  First  Newton  Step 

o.uuctu  I  - 

1.30E+01  - 1  +  2nd  Order  O  1st  Order 


0  175.000  350.000  525.000  700.000  0  175.000  350.000  525.000  700.000 

#  of  Unknowns  #  of  Unknowns 


Fig.  3.4:  Scalability  of  a  range  of  different  preconditioning  strategies  for  JFNK  based 
solution  of  the  compressible  MHD  equations  used  to  compute  the  linear  growth  stage 
of  the  magnetized  Kelvin-Helmholtz  instability  as  a  function  of  problem  size.  The 
simulation  timestep  chosen  so  that  highest  resolution  requires  2  outer  Newton  iter¬ 
ations.  The  left  and  right  panel  shows  the  scaling  of  the  number  of  inner  Krylov 
iterations  with  problem  size  for  the  first  and  second  outer  Newton  iterations  respec¬ 
tively.  When  combined  with  adaptive  linear  convergence  criteria,  the  JFNK  system 
with  second  order  preconditioning  exhibits  scalability  approaching  optimal. 


the  scalability  of  the  approach  for  a  range  of  different  reconstruction  algorithms  and 
non-linear  convergence  criteria  for  the  MHD  eigensystem  given  by  [29]:  by  setting 
the  non-linear  convergence  criteria  to  | |F(ajfc) 1 12  <  ea  +  er| |F(a;o)| (2  while  at  each 
outer  Newton  iteration  setting  the  Krylov  solver  convergence  criteria  to  ||^  + 

F(£fc)||2  <  Cfc||F(*o)||2,  we  are  able  t°  obtain  an  algorithm  that  approaches  optimal 
for  compressible  MHD. 


3.3.  Physics-Based  Preconditioners  for  the  Adiabatic  Two-Fluid  Plasma 
Equations.  Having  demonstrated  the  feasibility  of  our  approach  for  preconditioning 
the  compressible  MHD  equations,  we  now  extend  this  approach  to  electron-ion  fluid 
plasmas  with  full-wave  electromagnetics.  The  evolution  of  the  two  species  are  de¬ 
scribed  by: 


+  V  •  (psvs)  =  0 


<9(psvs) 

dt 


+  v  •  (psvsvs  + 1 ps  +  ns) 


=  nsqs  E  +  J,xB-^  P«PiAs,»(vs  -  v») 
i^s 

dCa 

+  V  •  ((es  +  ps)vs  +  vs  •  ts  +  qs) 


dt 


—  E  •  Js  psPi^s,i(Vs  vi) 


i^s 


(msvs  -  mjVj) 

(■ ms  -  rrii) 


(3.23) 

(3.24) 


(3.25) 


where  s  =  i,e  denotes  the  ‘species’  (e.g.  ions,  electrons),  ps  is  the  species  density, 
vs  is  the  species  velocity  and  es  is  the  total  energy  for  the  species.  These  equations 
are  coupled  through  collision  operators  (e.g.  JT  ,s  Ps/°?^s,i(vs  —  Vj)  in  eqn.  3.24)  and 
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through  the  electromagnetic  fields,  described  by  Maxwell’s  equations: 


9B 

~9t 


=  -V  x  E 


5E  -  „ 

—  -  c  V  x  B  =  p0J 
at 


(3.26) 

(3.27) 


In  order  to  make  use  of  the  physics-based  preconditioner  developed  in  §3.2,  we 
reformulate  the  two  fluid  plasma  model  in  terms  of  the  center-of-mass  total  density, 
total  momentum  density,  total  charge  density  and  total  current  density,  defined  as: 


p  =  5>;  dv  =  X^sV: 

s  s 

Pc  =  ^  Qsm~1ps-,  J  = 


qsms  Vsvs 


(3.28) 


With  these  definitions,  we  are  able  to  write  the  two  fluid  plasma  equations  in  MHD- 
like  form: 


U  = 


p 

pv 

E 

;  F(U)  = 

Pc 

J 

pw 

pvvT  +  pi 
(E  +  p)v 

J 

S7V  •  (psvsvs  +  Ips) 


S(U)  = 


0 

pcE  t  J  x  B 

E  J 

0 


XX  ms  +  Js  X  B  XX^s  AsPi^s,i(Vs  vi)^ 


(3.29) 


where  the  evolution  of  the  electromagnetic  fields  are  described  by  Maxwell’s  equations 
(3.26,3.27). 

The  next  step  in  reformulation  of  the  two  fluid  equations  to  make  the  amenable 
to  preconditioning  using  the  MHD  eigensytem  is  to  reformulate  the  source  terms  for 
the  total  momentum  and  energy  equations: 


S(pv)  =  pcE  +  J  x  B;  S(E)  =  E  •  J 


(3.30) 


These  can  be  rewritten  in  terms  of  conservation  of  electromagnetic  momentum  and 
energy,  using  the  standard  identities  (see,  e.g.,  Jackson,  1975,  [34]): 


PcE  +  J  x  B  =  -  4  +  V  •  Tem 

cz  ot 

E  .  J  =  _dEEM  _  v  .  Sem 


where: 


Eem  = 


2p0 


1  /EEt 

Tem  =  —  — 

Po  V  c2 


dt 

S  EM 

I  •  E 

BBt 


E  x  B 

Po 

hB  B 

■  EEm I 


(3.31) 

(3.32) 

(3.33) 

(3.34) 

(3.35) 
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where  Eem  is  the  energy  density  in  the  electromagnetic  held,  S em  is  the  Poynting 
flux  and  Tem  is  the  electromagnetic  stress  tensor.  The  remaining  source  term  is 
associated  with  the  evolution  of  the  total  current: 


S(J)  =  £ 


m. , 


nsqs  E 


Js  X  B  ~^pspi  As,,:(vs  -  v<) 

i^s 


(3.36) 


This  source  term  can  be  made  amenable  to  preconditioning  by  writing  in  terms  of  p, 
v  and  J: 


[{kpp  +  kPcpc)  (E  +  77 J)  +  (kppv  +  kPc  J)  x  B] 


(3.37) 


The  MHD-like  two  fluid  equations  written  in  conservation  law  form  are  then: 


p 

pv 

pv  +  c~2SEm 

pvvT  +  pi  -  Tem 

E  +  EEm 

;  F(U)  = 

(E  +  p)v  +  S  em 

Pc 

J 

J 

.  Es  •  (psvsvs  +  I ps) 

S(U) 


0 

0 

0 

0 

(^7)  PpP  +  kPopc)  (E  +  77 J)  +  (fcppv  +  kPc  J)  x  B] 


(3.38) 


Note  that  the  conserved  quantities  now  include  contributions  from  the  electromagnetic 
held  in  both  the  energy  and  the  momentum.  We  also  remark  that  the  contribution  of 
the  electromagnetic  held  to  the  dehnition  of  the  total  momentum  is  suppressed  by  a 
factor  c2,  while  the  coupling  of  the  net  charge  to  the  electric  held  has  been  replaced  by 
contributions  0(E2/c2).  As  such,  the,  the  equations  describing  the  evolution  of  total 
mass,  momentum  and  energy  can  therefore  be  preconditioned  neglecting  contributions 
from  the  electric  held  as  these  can  only  be  important  when  E  ~  O(c);  in  such  a  case 
of  a  relativistically  strong  electric  held,  it  is  inappropriate  to  use  the  above  set  of 
equations  and  instead,  the  equations  of  relativistic  fluid  dynamics  should  be  considered 
(e.g.  [35]). 

The  equations  of  total  charge  and  current  conservation  require  a  separate  precon¬ 
ditioning  strategy.  Here,  we  utilize  the  eigensystem  for  compressible  hydrodynamics 
(e.g.  [29])  and  write  the  preconditioner  for  this  system  as: 


<9F 

dx 


SUk 

k 


1 


eAtJ2 

s 


qs  dF  (Us)  dUs 
ms  dUs  dU 


+  9  Ai¬ 


ds  (U) 
dU 


6Uk 


U  = 


Pc 

J 


U'  = 


Ps 

PsVs 


F(US)  = 


PsVs 

psVsvJ  +  ps  I 
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Fig.  3.5:  Classic  Brio-Wu  shock  tube  in  the  MHD  (left  panel)  and  two-fluid  (right 
panel)  regimes,  both  computed  using  the  fully  implicit  two-fluid  model  in  MHD-like 
form.  The  data  of  these  figures  demonstrate  the  ability  of  the  code  to  handle  shocks 
across  a  wide  range  of  parameter  regimes. 


Fig.  3.6:  Density,  velocity  and  magnetic  field  distributions  for  the  evolution  of  a  ion- 
electron  shear  instability  using  the  fully  implicit  two-fluid  model  in  MHD-like  form. 
In  this  calculation,  the  ions  and  electrons  are  unstable  on  different  timescales,  which 
leads  to  generation  of  magnetic  islands,  demonstrating  the  ability  of  the  scheme  to 
handle  separate  ion  and  electron  dynamics. 


where  8Us/dU  can  be  determined  through  inspection  of  eqn.  3.28.  In  addition,  since 
this  system  of  equations  requires  knowledge  of  both  the  ion  and  electron  pressure 
separately,  we  evolve  an  entropy  conservation  equation  for  the  electron  fluid  to  en¬ 
able  separation  of  the  ion  and  electron  pressure  in  the  total  energy  equation.  We 
precondition  Maxwell’s  equations  through  the  eigensystem  due  to  [27],  using  a  Gen¬ 
eralized  Lagrange  Multiplier  method  described  therein,  with  the  addition  of  a  a  linear 
hyperbolic  system  that  can  be  solved  directly  for  the  electric  field.  As  for  the  MHD 
equations,  we  precondition  the  two-fluid  system  by  constructing  P [71  using  Trilinos 
ML  Uncoupled  Smoothed  Aggregation  with  5  levels;  on  each  level,  we  use  block  ILU 
smoothing  with  zero  overlap  and  symmetric  Gauss-Seidel  relaxation  on  each  level  and 
choose  the  block  size  to  be  equal  to  the  number  of  partial  differential  equations  in  the 
system  (here,  this  is  18). 
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This  approach  allows  us  to  attack  two-fluid  plasma  problems  across  a  range  of 
regimes  within  a  common  solver  framework.  The  data  of  Fig.  3.5  demonstrates  classic 
one-dimensional  shock  tubes  in  both  the  MHD  and  two-fluid  regimes,  where  in  each 
case  the  shock  is  evolved  on  the  MHD  timescale,  while  the  lightwave  is  many  orders 
of  magnitude  faster.  Next,  the  data  of  Fig  3.6  demonstrate  the  application  of  this 
approach  to  two-fluid  shear  instabilities  where  the  ions  and  electrons  are  unstable  on 
different  timescales,  which  leads  to  generation  of  magnetic  islands.  In  both  cases, 
the  system  was  evolved  on  the  MHD  timescale,  e.g.  signficantly  above  the  timescales 
associated  with  the  propagation  of  electromagnetic  waves  in  the  fluid.  Taken  together, 
these  example  problems  demonstrate  that  the  MHD-like  formulation  of  the  two  fluid 
plasma  equations  reduces  spurious  divergence  errors  in  the  electric  field  and  can  handle 
separate  ion  and  electron  dynamics.  Verification  and  validation  work  on  the  two-fluid 
plasma  model  is  ongoing  under  AFOSR  Grant  FA9550-14-C-0004  and  will  be  reported 
elsewhere  [36]. 
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3.4.  Physics-based  Preconditioners  for  the  Isothermal  Two-Fluid  Plasma 
Equations.  The  evolution  of  a  ion-electron  isothermal  plasma  is  described  by  con¬ 
servation  of  mass  and  momentum  (e.g.  eliminating  the  energy  equation  from  the 
compressible  system  described  above): 


d tna  +  V  •  pQ  =  0 , 
madtTpa  +  maV  •  Sa  —  qanaB  —  qapa  x  B  =  0, 


(3.41) 


where  ma  is  the  mass,  qa  is  the  charge,  pa  is  the  momentum  density,  na  is  the  number 
density  and  Sa  is  the  stress  tensor,  for  species  a  =  i,e.  A  simple  isothermal  closure 
is  supplied:  (maV  •  SQ)  — >  ( Ta\7na )  +  maV  •  ^P°P°  ^  ,  with  Ta  the  temperature.  The 
fluid  equations  couple  directly  to  Maxwell’s  field  equations, 


V  ■  E 

p_ 
eo  ’ 

(Gauss’  Law) 

V  x  E  +  dtB 

=  o. 

(Faraday’s  Law) 

VB 

=  0, 

(Solenoidal  Constraint) 

— -f  V  x  B 

=  Moj, 

(Ampere’s  Law) 

where  E  is  the  electric  field,  B  is  the  magnetic  field,  eg  is  the  permitivity,  po  is  the 
permeability,  p  is  the  charge  density,  and  j  is  the  current  density.  The  charge  density 
and  current  density  are  related  directly  to  the  number  density  and  momentum  density 
by  p  =  qene  +  g^n,;  and  j  =  gepe  +  g^p,.  The  time  derivatives  are  implicitly  discretized 
(dt  —>  A),  and  all  known  explicit  pieces  are  collected  on  the  right-hand  side.  Coupling 
(3.41)  with  (3.42)  produces  the  first-order  system 
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(3.43) 


3.4.1.  Darwin  Approximation.  In  non- relativistic  kinetic  simulations  the  speed 
of  light  is  spurious  and  leads  to  numerical  instabilities.  Overcoming  these  instabilities 
can  be  accomplished  by  the  injection  of  artificial  dissipation,  but  only  at  the  expense 
of  energy  conservation  [37].  These  issues  can  be  circumvented  by  taking  the  asymp¬ 
totic  limit,  c  — ►  oo,  or,  alternatively,  eo  — >  0.  Unfortunately,  this  rigorously  holds 
true  only  in  the  quasineutral  approximation.  When  coupling  to  a  non-relativistic  ki¬ 
netic  simulation,  this  is  insufficient;  charge  separation  effects  are  required.  For  these 
reasons,  a  Darwin  model  is  used.  The  model  is  an  approximation  which  decomposes 
fields  into  their  solenoidal  and  irrotational  components  in  order  to  analytically  project 
out  light  waves  while  maintaining  charge  separation  in  Gauss’  law  [38],  [39].  Without 
derivation,  the  Darwin  model  is  (see  [40],  [39]  for  a  detailed  discussion): 


-/h)eo<9iEr 

dtB 


V  x  Er 

V  •  Er 
IVxB 

V  B 
VxEs 

V  •  Es 


0, 

_p 
eo  ’ 

Moj, 

0, 

0, 

0, 
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(Definition) 

(Gauss’  Law) 

Ampere’s  law) 
(Solenoidal  Constraint) 
(Faraday’s  Law) 
(Definition) 


(3.44) 
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where  Er  is  the  irrotational  electric  field  and  Es  is  the  solenoidal  electric  field  defined 
such  that  E  =  Er  +  ES.  Note  that,  by  definition,  B  is  solenoidal.  The  main  difference 
from  Maxwell’s  equations  (3.42)  is  that  <9tEs  has  been  removed  from  Ampere’s  law. 
Coupling  (3.44)  with  (3.41)  yields  the  TFP-Darwin  model: 
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subject  to  pure  conductive  boundary  conditions, 


n  •  Vna  =  0  , 

Pa  =  0, 

n  x  (Er  +  Es)  =  0  , 

n  •  B  =  0 , 


9Er 

SB 

0 

f*E.; 

0 


(3.45) 


(3.46) 


where  n  denotes  the  outward  boundary  normal.  To  make  the  system  (3.45)  well 
posed,  the  following,  stronger  boundary  conditions  are  imposed.  The  additive  con¬ 
straint  is  decomposed  into  separate  constraints  for  both  the  irrotational  and  solenoidal 
components  of  the  electric  field, 


n  x  Er  =  0  , 

n  x  Es  =  0  . 


(3.47) 


3.4.2.  FOSLS-TFP-Darwin.  In  this  section,  each  block  of  the  TFP-Darwin 
system  is  addressed.  The  block  is  scaled  to  ensure  optimal  algebraic  multigrid  (AMG) 
performance  on  that  block.  In  the  case  of  the  fluid  blocks,  the  system  is  modified  to 
allow  H1  ellipticity  of  that  block.  Then,  the  nonlinearities  on  the  full  TFP-Darwin 
model  are  addressed. 

As  presented  in  (3.45),  the  TFP  system  does  not  naturally  admit  full  H 1  -ellipticity. 
It  can  be  shown  that  the  system  (3.45)  is  Vo-elliptic  with  pQ  €  H  (div),  nQ  €  H1,  and 
Er,B,Es  G  (n1)3.  That  is, 

v0  =  n  (Div)  <8)  n1  <g>  n  (Div)  <8  n1  ®  ( n 1)3  ®  (n1)3  <g>  ( n 1)3 . 

Posed  in  this  space,  momentum  densities  are  not  in  Si1.  Special  care  is  required  to 
ensure  multigrid  algorithms  yield  optimal  results.  One  approach  is  to  use  %  (Div)- 
conforming  finite  elements,  for  example  Raviar-Thomas  elements,  for  the  momentum 
densities  and  use  a  special  multigrid  algorithm  that  employs  distributed  relaxation 
based  on  the  support  of  the  divergence-free  basis  elements  [41,  42].  Instead,  (3.45) 
is  directly  modified  by  introducing  an  additional  constraint  that  yields  an  7-^ 1  -elliptic 
system.  The  assumption  that  pa  G  H1  is  reasonable  for  most  plasma  systems  of 
interest. 
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To  accomplish  this,  denote  the  differential  matrix  system  in  (3.45)  by  £(u)  =  f 
and  define  the  block  differential  matrix  systems  as 


"  At  0  O' 
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'  Ui  ' 

0  Ae  0 

ue 
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ue 
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P> 
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- 1 

1 

c 
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i _ 
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_ 1 

(3.48) 


where  the  first  term  is  linear  and  lAi  and  lAe  contain  the  quadratic  terms  that  appear 
in  the  upper  right-hand  blocks  in  (3.45). 

Below,  each  block  is  examined.  First,  the  Darwin  block,  V ,  is  scaled  and  shown 
to  be  H1  elliptic.  Then,  the  fluid  blocks,  At  and  Ae,  are  modified,  scaled  and  shown 
to  H1  elliptic.  Finally,  the  Frechet  derivative  of  the  entire  system  is  addressed.  In 
Section  3.4.6  the  linearized  system  is  shown  to  be  uniformly  "H1  elliptic. 

3.4.3.  Darwin  Block.  The  Darwin  block,  V,  is  naturally  ^-elliptic,  but  con¬ 
tains  scalings  that  are  dependent  upon  physical  constants.  The  isolated  (uncoupled) 
Darwin  system  is 


Vx 
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0 
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0 
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Er 

o  H 
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,9b 

0 

fE, 

0 

The  formal  normal  of  T>  is, 


V*V 


2  2 
eo/4) 

St 

£oMo.  v  x 


—  A  - 


St. 


Tvx 

ffiVx 


0 

Tt^x 

-A 


(3.49) 


(3.50) 


Introducing  a  left  block  scaling  of  diag 


I  J  <5£  j  St  j  St2  j  St 2  j 
’  ’  COMO  1  ’  epMo  1  ’  eoMo  1  ’  eoMo  1 


and  a  right 


scaling  of  diag  [/,  fully  eliminates  all  physical  constants  from  V.  These 

modifications  generate  the  new  system 


'  Vx  0  O' 

'  0  ' 

V-  0  0 

Er 

9Er 

-1  Vx  0 

B 

fe 

0  V-  0 

0 

0  1  Vx 

.  Es  . 

0  0  V- 

o 

where  the  transformed  unknowns  are  defined  as:  B  =  -^-B,  and  E,  = 

The  right-hand  side  terms  are  transformed  in  a  similar  way.  Identify  the  transformed 
Darwin  block  as  T>.  With  standard  boundary  conditions,  it  is  straightforward  to  prove 
that  this  system  is  H1  elliptic. 

To  see  this,  consider  the  formal  normal  of  V , 


V*V 


X- A  -Vx  0 
—V  x  X  —  A  Vx 
0  Vx  -A 
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Lev. 

P(P) 

piV) 

1 

0.56 

0.048 

2 

0.60 

0.044 

3 

0.54 

0.052 

4 

0.50 

0.049 

5 

0.53 

0.050 

6 

0.53 

0.077 

Table  3.1:  AMG  convergence  factors  for  both  the  original  Darwin  system  (3.49)  with 
a  time  step  of  St  =  and  the  scaled  Darwin  system  (3.51)  using  quadratic  finite 
elements.  Level  1  is  a  4  x  4  quadrilateral  mesh  and  is  refined  uniformly  6  times  using 
NI. 


This  system  is  differentially  diagonally  dominant,  plus  the  diagonal  blocks  dominate 
the  off  diagonal  blocks. 

To  demonstrate  how  algebraic  multigrid  (AMG)  applied  to  this  subsystem  will 
behave,  let  p(Q)  denote  the  asymptotic  convergence  factor  of  AMG  applied  to  a 
discrete  form  of  the  operator  Q.  The  value  of  p{Q)  is  computed  by  allowing  AMG 
V-cycles,  on  each  level  of  refinement  to  continue  until  a  stable  factor  is  reached.  The 
asymptotic  AMG  convergence  factors  for  both  V  and  T>,  discretized  using  FOSLS,  are 
seen  in  Table  3.1,  where  a  time  step  of  St  =  was  used  in  the  unsealed  block.  The 
scaled  Darwin  block  performs  significantly  better  than  the  unsealed  Darwin  block. 

3.4.4.  Fluid  Blocks.  The  fluid  blocks,  Aa,  are  not  naturally  T^-elliptic.  The 
root  cause  of  this  is  due  to  the  fact  that  there  is  no  curl  term  associated  with  the 
momentum  densities,  pa.  The  isolated  fluid  block  is 


ma 

St 

TQv  ' 

Pa 

V- 

1 

St  . 

.  9na 

This  system  is  modified  by  introducing  the  Curl  of  conservation  of  momentum  as  an 
additional  equation, 


ma 

St 

TaX7 
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(3.54) 


Denote  the  differential  matrix  above  by  Ca.  The  modified  system,  Ca,  can  easily  be 
shown  to  be  fully  T^-elliptic  with  appropriate  boundary  conditions.  The  additional 
Curl  constraint  forces  the  momentum  density  to  be  in  the  smoother  space,  H1.  This 
is  a  reasonable  requirement  to  enforce  on  momentum  density.  The  formal  normal  of 
Ca  is 


C*C  = 


st 2 


(Z-A  )  +  (# 

_  /  maTa _ 1_ 

V  St  St 


-  i)  vv- 
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( _  J_)  V7 
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(3.55) 


Notice  that  the  upper  left  block  may  become  dominated  by  the  —  VV-  term,  de¬ 
pending  on  the  size  of  the  physical  parameters.  Introducing  a  left  block  scaling  of 


diag 


and  a  right  block  scaling  of  diag 


y/rriccTc 


yields 
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Lev. 

P(Q 

p(Ci) 

p{Ce) 

PiCe) 

1 

0.47 

0.054 

0.87 

0.11 

2 

0.73 

0.084 

0.97 

0.13 

3 

0.76 

0.12 

0.99 

0.14 

4 

0.81 

0.13 

0.99 

0.16 

5 

0.87 

0.15 

- 

0.16 

6 

0.90 

0.14 

- 

0.17 

Table  3.2:  AMG  convergence  factors  for  both  the  original  fluid  system  (3.54)  and 
the  scaled  fluid  system  (3.56)  using  quadratic  finite  elements.  Level  1  is  a  4  x  4 
quadrilateral  mesh  refined  uniformly  6  times  using  NI. 


the  system 


y/rricc 

St^/TZ 
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y/ma 

Sty/T, ~ 

<a  and 

foot  —  Ta 

Pa 
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V  X  fpa 
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(3.56) 


The  right-hand  side  terms  are  transformed 
in  a  similar  way.  Define  the  differential  matrix  in  (3.56)  as  Ca.  The  formal  normal  is 


C*C  — 


TiSt2 


-A 


0 


T^t 2 


-A 


(3.57) 


The  off  diagonal  blocks  in  the  formal  normal  have  been  completely  eliminated,  pro¬ 
ducing  an  1  -elliptic  block  diagonal  system.  That  is,  multigrid  cycles  will  be  optimal 
for  (3.56)  when  discretized  using  FOSLS. 

The  asymptotic  AMG  convergence  factors  for  C,;,  C, ,  Ce ,  and  Ce  are  displayed  in 
Table  3.2.  A  time  step  of  St  =  was  used.  Both  the  scaled  ion  and  electron  fluid 

fDp  ^  e 

blocks  show  significant  improvement  over  their  unsealed  counterparts.  The  unsealed 
electron  fluid  block  benefits  the  most  from  scaling  due  to  the  stiffness  of  the  electron 
mass.  The  original  fluid  blocks,  Aa ,  are  not  77 1  elliptic  and  standard  AMG  performs 
poorly  on  their  discrete  systems. 

3.4.5.  Full  TFP  System.  Following  the  scalings  and  modifications  outlined  in 
Section  3.4.3  and  Section  3.4.4  through  the  entire  system,  (3.45)  yields 
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where  Ca  and  T>  are  defined  previously.  The  scaled  off  diagonal  blocks  are 
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and 
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(3.60) 

for  a  =  i,e.  The  unknowns  and  right-hand-side  functions  are  also  scaled  appropriately 
as  outlined  in  previous  sections. 

The  Freclret  derivative  used  in  the  Newton  step  becomes 
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(3.61) 


Ka  = 
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for  a  =  i,e,  and  open  parentheses  indicate  an  operation  to  be  taken  first. 

A  final  scaling  is  performed  on  this  system  in  order  to  balance  the  magnitude  of 
the  upper  and  lower  off-diagonal  blocks  without  altering  the  modifications  made  to 
the  block  diagonals  1 .  This  reduces  the  size  of  the  off-diagonal  terms  in  the  formal 
normal.  Let 


a  =  max 


£oMo  eoMo 
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The  final  system  involves  the  block  scaling 
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1  It  is  assumed  that  the  system  was  previously  nondimensionalized  and  that  all  constants  represent 
unitless  quantities. 
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3.4.6.  Uniform  TLl  Ellipticity.  An  important  property  of  the  operator  C'{ u) 
is  that  it  be  uniformly  coercive  and  continuous  in  a  convenient  norm  for  all  u  in 
some  neighborhood  of  the  exact  solution,  u*.  This  guarantees  the  existence  of  each 
Newton  step  once  the  approximation  is  sufficiently  accurate.  Below  it  is  shown  that, 
under  mild  hypotheses,  £'(u)  is  uniformly  "^-elliptic  in  a  neighborhood  of  the  exact 
solution.  Toward  that  end,  define  the  open  ball, 

Br(u,)  =  {u  £  V  :  || u  —  u*||wi  <  r}.  (3.62) 

It  was  shown  in  Section  3.4.4  that  C*  and  Ce  are  coercive  and  continuous  in  TL1.  Here, 
a  further  assumption  is  made  on  the  H1  ellipticity  of  the  block  diagonals,  C,;  +U'u  and 

Ce  +M'ee- 

Theorem  3.1.  Let  Ll  be  a  convex  domain  with  piecewise  C1,1  boundary.  Assume 
that  there  exist  r  >  0  and  positive  constants,  ca(r)  and  Ca(r),  such  that,  for  every 
u  £  Br( u*)  and  for  every  Sa  £  (fH1)4, 

ca{r)\\8a\\ni  <  \\{Ca  +U'oia{u))8ol\\  <  Ca{r)\\5a\\ni,  (3.63) 

for  a  =  i,e.  Further,  assume  that  C! (u)  is  injective  for  every  u  £  Br( u*).  Then, 
there  exist  positive  constants,  c(r)  and  C(r),  such  that,  for  every  u  £  Rr(u*)  and  for 
every  6  £  (H1)17 , 

c(r)fc1<||£'(u)[J]||<C'(r)||5||wl.  (3.64) 


Proof.  The  upper  bound  in  (3.64)  is  easily  obtained  and,  for  the  sake  of  brevity, 
the  proof  of  the  lower  bound  is  only  outlined.  The  ellipticity  of  V  was  established 
in  3.4.3,  with  some  constants,  say  cD,CD  >  0.  Let  u  €  Br( u*).  By  assumption 
(3.63),  the  operator  with  only  the  block  diagonals  is  TL1  elliptic  with  constants  cB  = 
min[cQ(r),cD],  CB  =  nra x[Ca,CD\. 

It  is  clear  that  the  upper  blocks  satisfy 

\\UaJD\\  <  C'aJ'U*!  <  CaJcD\\VSDl  (3.65) 


for  some  constants  CaD( u)  >  0,  for  a  =  i,e.  Now,  consider  the  system  with  the 
strictly  lower-triangular  blocks  removed.  It  is  straight  forward  to  establish 
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with  r]  =  (1  +  max[COD\/ cn).  This  establishes  the  lower  bound  for  the  upper  block 
triangular  part  of  the  system.  The  proof  is  completed  by  noting  that  Ca  contain  only 
zeroth  order  terms  and  by  applying  a  standard  compactness  argument.  □ 

Remark  3.1.  It  should  be  clearly  asserted  that  the  assumption  on  the  Tl1- 
ellipticity  of  (Ca  +U'aa(u))  has  not  yet  been  rigorously  validated.  The  numerical  tests 
in  Section  3-4-7  provide  a  strong  indication  that  the  above  assumption  is  plausible.  A 
more  thorough  examination  of  ellipticity  is  currently  being  explored  by  the  authors. 

Full  7-^ 1  -ellipticity  implies  that  standard,  'H1-conforming,  finite  element  spaces 
can  be  used  and  standard  convergence  bounds  apply.  In  the  linear  case,  enhanced  L2 
convergence  is  also  ensured  [43].  That  is,  the  L2-norm  of  the  error  will  converge  one 
order  faster  than  the  functional  norm  of  the  error.  This  is  observed  in  the  numerical 
tests  below. 
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3.4.7.  Numerical  Tests.  In  this  section  the  performance  results  of  the  FOSLS- 
TFP  model,  solved  with  NI,  are  presented.  No  time  stepping  algorithm  is  imple¬ 
mented.  Instead,  focus  is  given  to  the  system  generated  for  a  fixed  time  step  by 
manufacturing  solutions  and  analyzing  the  AMG  solver  performance  and  FOSLS  con¬ 
vergence  properties.  Two  different  problems  are  analyzed.  In  the  first,  a  simple 
smooth  solution  is  constructed  from  sines  and  cosines.  In  the  second,  a  solution  is 
manufactured  with  a  sharp  gradient  in  the  2-component  of  momentum  density  and  a 
large  circulation  in  the  x  and  y  components  of  the  magnetic  field.  Large  features  in 
the  fluid  variables  produce  substantial  nonlinearities  in  off  diagonal  components,  as 
appear  in  (3.60).  It  is  noted  that  neither  of  these  manufactured  solutions  are  physical, 
but  instead,  are  used  to  demonstrate  how  the  nonlinear  solver,  nested-iteration,  and 
AMG  perform. 

The  problems  are  solved  on  the  computational  domain  =  [0,1]  x  [0,1].  The 
system  is  nondimensionalized  against  ion  mass,  rrii ,  fundamental  charge,  e,  Debye 
length  multiples,  £Xd,  ion  number  density,  no,i,  and  the  electron  thermal  velocity,  Vth,e 
(Note:  £  specifies  the  number  of  Debye  lengths  the  computational  domain  represents). 
For  each  simulation,  the  ratios  and  £  must  be  specified.  Unless 

7  rrii  7  vth,e  7  c  7  ^ 

otherwise  denoted,  take  ^  =  5.44  x  1CT4,  =  10-2,  =  10-2,  and  £  =  100. 

"‘-i  ^th,e  C 

In  all  tests  a  time  step  of  St  =  10 -^JL-  is  used. 

The  coarsest  problem,  Level  0,  is  a  4  x  4  mesh  discretized  with  bilinear  finite 
elements.  A  random  initial  guess  is  made  and  Newton-FOSLS  is  performed  until  a 
desired  tolerance  is  reached.  As  will  be  demonstrated,  the  work  performed  on  Level  0 
is  negligible.  Using  p- refinement,  the  initial  guess  for  Level  1  is  constructed  from  the 
approximate  solution  on  Level  0,  and  standard  NI-Newton-FOSLS  is  continued  until 
the  finest  level  is  reached. 

3.4.8.  Baseline  Test.  In  this  test  a  simple  smooth  solution  is  constructed  from 
sines  and  cosines.  The  number  densities  are  taken  to  be  small  perturbations  away  from 
1  and  most  of  the  momentum  and  field  quantities  are  taken  to  be  small  perturbations 
away  from  0.  The  exception  is  the  quantities  pjiZ,  pe,z ■>  Bx,  and  By,  which  are  0(1). 
Prescribe  the  exact  solution  to  be: 


Pi,x,y 

= 

sin  (-kx)  sin  (iry)  , 

Pi,z 

= 

sin  (7ra:)  sin  (ity)  , 

rii 

= 

1  +  yq  cos  ( 2ttx )  cos  (27 r; 

Pe,x,y 

= 

~To  sin(7ra;)sin(7n/)  , 

Pe,z 

= 

—  sin  (7ra;)  sin  [iry)  , 

ne 

= 

1  +  Yg  cos  (27ra)  cos  (27T 

E 

^r^x 

= 

^  cos  (27ra;)  sin  (ny)  , 

E 

J-/r,y 

= 

sin  (ttx)  cos  (2iry)  , 

Er,z 

= 

^  sin  (t tx)  sin  (tt y)  , 

Bx 

= 

sin  (7ra:)  cos  (27 xy)  , 

By 

= 

cos  (27ra;)  sin  (ny)  , 

Bz 

= 

cos  (27ra;)  cos  (27 xy)  , 

Es,x 

= 

^  cos  (27ra;)  sin  (ny)  , 

Es,y 

= 

^  sin  (nx)  cos  (2ny)  , 

Es,z 

= 

Yfi  sin  (nx)  sin  (ny)  . 

The  asymptotic  AMG  convergence  factors,  p ,  for  each  level  are  seen  in  Table  3.3.  The 
factors  are  computed  by  setting  the  AMG  solver  tolerance  to  1CU6,  effectively  allowing 
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Lev. 

pi(£) 

P2(£) 

1 

0.51 

0.57 

2 

0.57 

0.62 

3 

0.49 

0.48 

4 

0.49 

0.52 

5 

0.53 

0.59 

6 

0.59 

— 

Table  3.3:  Problem  (3.66):  Asymptotic  AMG  convergence  factors  for  the  final  scaled 
nonlinear  TFP-Darwin  system.  A  relative  AMG  Solver  tolerance  of  10-6  was  used. 
Level  1  is  a  4  x  4  quadrilateral  meshed  discretized  with  quadratic  elements,  pi  denotes 
the  factor  produced  for  Newton  iteration  i. 


AMG  V-cycles  to  continue  until  a  stable  convergence  factor  is  reached.  This  is  done 
independently  from  the  NI-Newton-FOSLS  computations.  Several  Newton  iterations 
are  performed  on  each  level,  each  with  it’s  own  asymptotic  factor.  Denote  pi  as  the 
asymptotic  convergence  for  each  Newton  iteration  i.  The  factors  hold  between  0.50  - 
0.60.  A  substantial  quantity  of  Newton  iterations  are  performed  on  Level  0,  and  thus, 
not  included  in  the  table.  Averaging  over  all  V-cycles  for  every  Newton  iteration  on 
Level  0  gives  an  average  asymptotic  convergence  factor  of  0.35. 

The  normalized  nonlinear  FOSLS  functional  values,  ||£(u)  —  /||o,  are  seen  in  Fig. 
3.7,  where  Uq  represents  the  value  before  Newton-FOSLS  has  been  performed  and  uf 
represents  the  value  after  convergence  is  achieved.  The  values  are  normalized  such 
that  the  initial  FOSLS  functional  on  Level  1  has  a  value  of  1.  Initially,  the  functional  is 
decreasing  at  a  rate  of  nearly  0(/j3),  but  as  refinement  continues  it  begins  to  approach 
0(h2).  The  L2-error  of  the  solution  after  each  Newton-FOSLS  process  is  plotted  in 
Fig.  3.8.  The  values  are  normalized  such  that  the  the  initial  L2-error  on  Level  1  has 
a  value  of  1.  The  error  is  decreasing  at  a  rate  of  0(h3).  The  rate  of  i2-convergence  is 
one  order  higher  than  that  of  the  FOSLS  functional,  as  expected,  using  the  well-known 
Aubin-Nitsche  duality  argument  [43]. 

The  benefit  of  NI  is  well  demonstrated  in  Table.  3.4.  Define  a  Work  Unit  (WU) 
to  be  the  cost  of  one  matrix-vector  multiplication  on  the  finest  level.  It  takes  a  total 
of  20.64  WU  in  order  to  bring  the  solution  through  the  coarse  grids  and  solve  the 
nonlinear  problem  on  Level  6.  The  8  Newton  iterations  performed  on  Level  0  amounts 
to  only  0.05%  of  the  total  work.  Notice  that  the  number  of  Newton  iterations  required 
on  each  successive  level  decreases  such  that  by  the  time  the  finest  grid  is  reached  only 
1  Newton  iteration,  using  3  AMG  V-cycles,  is  required.  In  this  test,  the  components  of 
the  solution  are  smooth  and,  in  turn,  keep  the  off  diagonal  couplings  small.  Another 
test  problem  is  used  to  demonstrate  how  the  Newton-FOSLS-NI  approach  performs 
on  problems  with  sharper  features  (ie.  steep  gradients). 


3.4.9.  Sharp  Current  Density  Test.  In  this  test  all  components  of  the  solu¬ 
tion  remain  the  same  as  in  problem  (3.66)  except  for  Pi}Z,  pe,zi  Bx,  and  By.  For  these 
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Fig.  3.7:  Problem  (3.66):  The  initial  and  final  nonlinear  FOSLS  functionals  through 
6  levels  of  NI  with  uniform  refinement.  The  tolerance  of  the  AMG  Solver  is  10-1 
and  the  tolerance  of  the  Newton  iteration  is  10-2.  The  value  of  h  is  defined  as 


where  Ne  is  the  number  of  elements.  The  values  are  normalized  such  that  the  initial, 
nonlinear  FOSLS  functional  on  Level  1  has  a  value  of  1. 


Lev. 

Newton  Iters. 

Tot.  V- Cycles 

NNZ 

WU 

0 

8 

39 

4.3  x  104 

0.01 

1 

3 

11 

2.8  x  105 

0.02 

2 

3 

13 

1.1  x  10B 

0.13 

3 

3 

12 

4.3  x  106 

0.61 

4 

3 

13 

1.7  x  107 

2.72 

5 

2 

9 

6.8  x  107 

7.83 

6 

1 

3 

2.7  x  10s 

9.31 

total 

— 

— 

— 

20.64 

Table  3.4:  Problem  (3.66):  The  number  of  Newton  Iterations  (Newton  Iters.),  Total 
number  of  AMG  V-cycles  (Tot.  V-Cycles),  number  of  nonzeros  in  the  linear  operator 
(NNZ),  and  Work  Units  (WU),  on  each  level  of  the  NI  process.  A  WU  is  defined  as 
the  cost  of  one  matrix-vector  multiplication  on  the  finest  level.  The  tolerance  of  the 
a  AMG  Solver  is  10-1  and  the  tolerance  of  the  Newton  iteration  is  10~2. 


unknowns  define: 

Pi,z  =  sin(Tra)  sin(Try)  exp  (-  (x^°25)  )  exp  (-  {v ^°25)  )  , 

Pe,z  =  -  sin(7ra:)  sm(ny)  exp  (-  [x ~°25)  )  exp  (-  ~°25)  )  , 

Bx  =  ~{y  -  0.5)  sin(7ra:)  exp  {x ~°25)  'j  exp  )  , 

By  =  (. x  -  0.5)  sin(? vy)  exp  (~^^5)  )  exp  (-  (y~°25)  )  , 


(3.67) 
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Fig.  3.8:  Problem  (3.66):  The  final  L2-error  on  each  level,  through  6  levels  of  NI  with 
uniform  refinement.  The  tolerance  of  the  AMG  Solver  is  10_  1  and  the  tolerance  of 
the  Newton  iteration  is  10-2.  The  solutions  are  normalized  such  that  the  error  on  the 
Level  1  is  1.  Here  u  represents  the  interpolant  of  the  exact  solution.  The  value  of  h 
is  defined  as  -A-  where  Ne  is  the  number  of  elements. 


Lev. 

Pi(C) 

P2(C) 

1 

0.54 

0.55 

2 

0.59 

.60 

3 

0.45 

— 

4 

0.47 

— 

5 

0.55 

— 

6 

0.54 

— 

Table  3.5:  Problem  (3.67):  Asymptotic  AMG  convergence  factors  for  the  final  scaled 
nonlinear  TFP-Darwin  system.  A  relative  AMG  Solver  tolerance  of  10-6  was  used. 
Level  1  is  a  4  x  4  quadrilateral  meshed  discretized  with  quadratic  elements,  pi  denotes 
the  factor  produced  for  Newton  iteration  i. 


with  a  value  of  a  =  0.02.  The  solution  contains  a  current  density  (jz  =  qiPitZ  +QePe,z ) 
in  the  ^-direction  with  a  steep  gradient  and  a  corresponding  strong  circulation  in  the 
(x,  j/)-components  of  magnetic  field.  A  sketch  of  these  components  are  visualized  in 
Fig.  3.9. 

The  asymptotic  convergence  factors  are  found  in  Table  3.5.  The  factors  remain 
mostly  unchanged  from  problem  (3.67),  living  in  the  range  of  0.5  -  0.6.  Level  0  is  not 
included  in  Table  3.5  because  a  large  number  of  Newton  iterations  were  performed. 
The  average  AMG  convergence  factor  over  all  V-cycles  on  Level  0  is  0.45.  A  slightly 
tighter  relative  tolerance  of  10-2  was  required  for  the  AMG  solver  in  order  to  keep 
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U 


Fig.  3.9:  Problem  (3.67):  A  sketch  of  the  ^-component  of  current  density  and  (x,y)- 
components  of  magnetic  field. 


the  number  of  Newton  iterations  low  on  the  finest  mesh.  The  normalized  nonlinear 
FOSLS  functional  values,  || C{u)  —  /||0,  are  seen  in  Fig.  3.10,  with  u0  and  Uf  defined 
previously.  For  the  coarser  grids,  the  FOSLS  functional  convergence  is  poor.  This  is 
likely  due  to  the  fact  that  the  features  present  in  the  current  density  and  magnetic 
fields  are  smaller  than  the  resolution  of  the  grid.  After  3  levels  of  ^.-refinement  (1024 
elements)  the  functional  approaches  the  desired  0(h2)  behavior.  The  L2-error  is  seen 
in  Fig.  3.11.  In  a  similar  way  to  the  nonlinear  FOSLS  functional,  the  L2-error  does 
not  begin  to  achieve  the  desired  0(h 3)  convergence  until  4  levels  of  ^.-refinement  are 
performed.  Further  refinement  would  reveal  a  more  conclusive  0(h3)  convergence 
trend.  A  summary  of  the  required  WUs  can  be  found  in  Table  3.6.  Again,  more 
nonlinear  iterations  are  performed  on  the  coarse  grids  where  V-cycles  and  Newton 
linearization  are  cheap.  By  the  time  that  the  finest  level  is  reached,  only  2  Newton 
iterations  and  a  total  of  9  AMG  V-cycles  were  required. 


24 


DISTRIBUTION  A:  Distribution  approved  for  public  release 


Number  of  Elements 


Fig.  3.10:  Problem  (3.67):  The  initial  and  final  nonlinear  FOSLS  functionals  through 
6  levels  of  NI  with  uniform  refinement.  The  tolerance  of  the  a  AMG  Solver  is  10-2 
and  the  tolerance  of  the  Newton  iteration  is  10“  2.  The  value  of  h  is  defined  as  -^=, 
where  Ne  is  the  number  of  elements.  The  values  are  normalized  such  that  the  initial 
nonlinear  FOSLS  functional  on  Level  1  has  a  value  of  1. 


Lev. 

Newton  Iters. 

Tot.  V- Cycles 

NNZ 

WU 

0 

7 

34 

4.3  x  104 

0.01 

1 

3 

11 

2.8  x  105 

0.02 

2 

4 

17 

1.1  x  106 

0.17 

3 

1 

3 

4.3  x  106 

0.13 

4 

2 

7 

1.7  x  107 

1.38 

5 

1 

3 

6.8  x  107 

2.21 

6 

2 

9 

2.7  x  10s 

31.33 

total 

— 

— 

— 

35.27 

Table  3.6:  Problem  (3.67):  The  number  of  Newton  Iterations  (Newton  Iters.),  Total 
number  of  AMG  V-cycles  (Tot.  V-Cycles),  number  of  nonzeros  in  the  linear  operator 
(NNZ),  and  Work  Units  (WU),  on  each  level  of  the  NI  process.  A  WU  is  defined  as 
the  cost  of  a  matrix- vector  multiplication  on  the  finest  level.  The  tolerance  of  the  a 
AMG  Solver  is  10~2  and  the  tolerance  of  the  Newton  iteration  is  10-2. 
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Fig.  3.11:  Problem  (3.67):  The  final  L2-error  on  each  level  through  6  levels  of  NI  with 
uniform  refinement.  The  tolerance  of  the  a  AMG  Solver  is  10-2  and  the  tolerance  of 
the  Newton  iteration  is  10~2.  The  solutions  are  normalized  such  that  the  error  on 
the  Level  0  is  1.  Here  u  represents  the  interpolant  of  the  exact  solution.  The  value  of 
h  is  defined  as  where  Ne  is  the  number  of  elements. 
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4.  Space-Time  Parallelization.  The  need  for  parallelism  in  time  is  being 
driven  by  changes  in  computer  architectures,  where  future  speed-ups  will  be  avail¬ 
able  through  greater  concurrency,  not  faster  clock  speeds.  This  leads  to  a  bottleneck 
for  sequential  time  marching  schemes  because  they  lack  parallelism  in  the  time  di¬ 
mension.  Multigrid  Reduction  in  Time  (MGRIT  [44,  45])  is  an  iterative  procedure 
that  allows  for  temporal  parallelism  by  utilizing  multigrid  reduction  techniques  and  a 
multilevel  hierarchy  of  coarse  time  grids.  The  goal  of  this  work  is  the  efficient  solution 
of  nonlinear  problems  with  MGRIT,  where  efficiency  is  defined  as  achieving  similar 
performance  when  compared  to  an  equivalent  linear  problem.  When  solving  a  linear 
problem  using  implicit  methods  and  optimal  spatial  solvers  (e.g.  classical  multigrid), 
the  spatial  multigrid  convergence  rate  is  fixed  across  temporal  levels,  despite  a  large 
variation  in  time  step  sizes.  This  is  not  the  case  for  nonlinear  problems,  where  the 
work  required  increases  dramatically  on  coarser  time  grids.  By  using  a  variety  of 
strategies,  most  importantly,  spatial  coarsening  and  an  alternate  initial  guess  for  the 
nonlinear  solver,  it  is  possible  to  reduce  the  work  per  time  step  evaluation  over  all 
temporal  levels  to  a  range  similar  to  those  of  a  corresponding  linear  problem.  This  al¬ 
lows  for  overall  speedups  comparable  with  those  achieved,  in  previous  work,  for  linear 
systems. 

Previously,  ever  increasing  clock  speeds  allowed  for  the  speedup  of  sequential  time 
integration  simulations  of  a  fixed  size,  and  for  stable  wall  clock  times  for  simulations 
that  were  refined  in  space  (and  usually  time).  However,  clock  speeds  are  now  almost 
stagnant,  leading  to  the  sequential  time  integration  bottleneck.  By  allowing  for  par¬ 
allelism  in  time,  much  greater  computational  resources  can  be  brought  to  bear,  and 
overall  speedups  can  be  achieved.  Because  of  this,  interest  in  parallel-in-time  methods 
has  grown  over  the  last  decade. 

Work  on  parallel-in-time  methods  actually  goes  back  at  least  50  years  (c.f.  [46] 
and  the  review  in  [47]),  but  most  of  the  historical  efforts  were  limited  to  single-  or  two- 
level  approaches.  This  project’s  focus  was  on  MGRIT,  a  true  multilevel  algorithm  with 
optimal  scaling  in  terms  of  both  parallel  communication  and  number  of  operations. 
Note  that  Parareal  [48],  perhaps  the  most  well  known  parallel-in-time  method,  is 
equivalent  [49]  to  a  two-level  multigrid  scheme. 

Sequential  time  marching  schemes  are  optimal  in  that  they  move  from  the  initial 
time  to  the  final  time  with  minimal  computational  cost.  By  applying  time  stepping  on 
various  levels  as  MGRIT  does,  some  computational  cost  is  traded  for  efficient  temporal 
concurrency.  This  efficiency  was  demonstrated  in  a  strong  scaling  study  of  MGRIT 
for  linear  diffusion  on  a  (257)2  x  16385  space-time  grid.  The  space-time  parallel  runs 
used  an  8  x  8  processor  grid  in  space,  with  all  additional  processors  being  added  in 
time.  MGRIT  showed  substantially  better  performance  that  sequential  time  stepping 
when  there  were  more  than  just  a  thousand  total  cores.  These  strong  scaling  also 
showed  significantly  increased  benefits  of  MGRIT  of  up  to  a  factor  of  ten  speedup  over 
sequential  time  stepping  in  important  practical  parameter  regimes.  Moreover,  weak 
scaling  results  proved  MGRIT  to  be  scalable  for  nonlinear  problems,  with  iteration 
counts  bounded  independently  of  problem  size. 

5.  Fault  Resilience.  As  the  complexity  of  high  performance  computing  (HPC) 
systems  continues  to  progress,  constraints  on  system  design  force  the  handling  of  errors 
to  higher  levels  in  the  software  stack.  Checkpoint  restart  is  the  predominate  way  to 
recover  from  fail-stop  errors  (errors  that  cause  unexpected  application  termination 
such  as  power  outage,  node  reboot,  and  software  crash).  Current  approaches  exploit 
the  complex  memory  hierarchies  of  HPC  systems  and  are  application  tailored  for 
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optimal  performance.  HPC  faces  a  potentially  more  harmful  class  of  errors  through  so- 
called  bit-flips  that  are  caused  by  a  number  of  external  affects  such  as  energetic  particle 
strikes.  These  are  regarded  as  particularly  harmful  as  they  are  unexpected  by  the 
application  and  may  have  silent,  unknown  impact  on  the  simulation  results.  This  silent 
data  corruption  (SDC)  can  be  masked,  requiring  extra  time  for  a  correct  simulation, 
yielding  invalid  results,  or  causing  application  failure.  Due  to  the  costly  side  effects, 
applications  are  employing  lightweight  SDC  detectors  to  verify  the  correctness  of  a 
simulation.  For  example,  Figure  5.1  shows  different  bits  corrupted  in  the  residual 
calculation  of  a  multigrid  cycle,  leading  to  a  range  of  results,  from  very  little  impact, 
to  a  significant  increase  in  the  time-to-solution.  ‘ 


Fig.  5.1:  Residual  history  for  different,  corrupted  bits. 


Emerging  and  next  generation  architectures  are  expected  to  introduce  unprece¬ 
dented  levels  of  faults,  requiring  new  approaches  to  fault  resilience  and  algorithm 
development.  One  difficult  aspect  in  preparing  algorithms  and  application  codes  for 
environments  with  increased  levels  of  SDC  is  testing.  To  facilitate  better  analysis  of 
high-performance  scientific  libraries,  a  fault  injection  and  analysis  tool  called  Fliplt 
was  created  [50].  Fliplt  uses  LLVM  to  expose  intermediate  IR  code.  This  IR  code 
is  used  to  create  faults  in  the  program  logic,  thus  allowing  for  general  fault  models. 
An  example  is  shown  in  Figure  5.2  where  arithmetic  is  corrupted  through  through 
the  transformed  code.  Indeed,  in  contrast  to  previous  research  that  focused  on  faults 
arising  in  heavily  protected  system  memory,  this  package  simulates  corruption  from 
processor  flow,  including  pointer,  arithmetic,  and  control  logic. 

The  Fliplt  fault  injection  framework  led  to  the  analysis  of  several  computing 
kernels,  including  the  sparse  matrix- vector  product  [51].  In  addition,  a  full  fault 
analysis  and  recovery  algorithm  was  introduced  in  a  multigrid  setting  [52] ,  where  it  was 
shown  that  the  multigrid  algorithm  may  recover  from  SDC  by  incorporating  several 
detection  strategies.  For  example,  an  energy  estimate  is  used  within  the  algorithm 
to  identify  errors  that  may  not  lead  to  a  segmentation  fault,  potentially  resulting  in 
incorrect  or  lengthy  computations.  In  addition,  simple  loop  checks  were  explored  as  a 
mechanism  for  pointer  logic  that  avoids  expensive  triplication  in  the  code.  Finally,  the 
approach  introduced  a  strategy  to  recover  from  segmentation  faults,  thus  avoiding  an 
expensive  restart  if  computation  could  proceed  normally.  The  cost  of  incorporating 
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define  i32  <3add(i32  7.a,  i32  7.b)  #0  { 
entry : 

7»add  =  add  nsw  i32  7»a,  7«b 


define  i32  @add(i32  7.a,  i32  7.b)  #0  { 
entry : 

7«add  =  add  nsw  i32  7»a,  %b 
ret  i32  7»add 


(a)  Original  LLVM  IR. 


7odata  =  sext  i32  7«add  i64 

7»tmp  =  call  i32  @crptlnt(i32  0,  i32  0, 

double  0.01,  i32  2,  i64  7.data) 
7»crptAdd  =  trunc  i64  7«tmp  to  i32 
ret  i32  %crptAdd  } 

(b)  Transformed  LLVM  IR. 


Fig.  5.2:  Intermediate  code  for  fault  intjection. 


these  strategies  is  shown  in  Figure  5.3  for  the  case  of  multigrid.  Even  in  parallel  the 
overhead  is  modest,  while  adding  a  significant  amount  of  protection  to  the  algorithm, 
leading  to  nearly  99%  successful  detection.  A  clear  benefit  of  the  approach  is  that  it 
is  applicable  to  a  wide  range  of  computations.  ‘ 


Fig.  5.3:  Cost  of  fault  detection  versus  processor  count. 


Understanding  how  (silent)  errors  propagate  through  scientific  codes  is  critical, 
particularly  in  a  high-performance  setting.  The  fault  analysis  revealed  fault  patterns 
across  codes  that  suggest  common  mitigation  strategies  [53].  For  example,  over  a 
large  sample  of  simulations,  faults  emerge  in  pointer  logic  around  40%  of  the  time, 
arithmetic  around  40%,  and  control  logic  around  20%.  The  impact  of  data  corruption 
on  other  parts  of  the  application  is  also  important.  More  recently,  the  injection 
approach  is  being  used  to  to  study  a  broader  range  of  methods  and  applications 
such  as  multigrid,  FFT,  particle  methods,  plasma  computations,  and  turbulent  fluid 
flow  [54],  The  approach  is  leading  to  visualization  tools  to  help  aid  developers  in 
identifying  sensitive  portions  of  their  code. 

The  fault  injection  framework  and  application  analysis  has  also  led  to  a  new  pro¬ 
cess  for  checkpointing  use  application-level  decisions.  Checkpointing  is  critical  for 
recovery  from  hard  faults  such  as  node  failure.  However,  lightweight  checkpointing  is 
emerging  as  a  critical  tool  for  recovery  when  silent  errors  are  detected  in  the  simula¬ 
tion.  The  fault  analysis  is  leading  to  the  construction  of  a  lossy  checkpoint  strategy 
wherein  simulation  states  are  checkpointed  in-memory.  This  introduces  a  small  error 
in  comparison  to  the  large  decrease  in  memory  (and  computational)  footprint. 
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